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Technical Brief 

The jmzQuantML programming interface and validator 
for the mzQuantlVIL data standard 

Da Qi, Ritesh Krishna and Andrew R Jones 
institute of Integrative Biology, University of Liverpool, UK 

The mzQuantML standard from the HUPO Proteomics Standards Initiative has recently 
been released, capturing quantitative data about peptides and proteins, following analysis 
of MS data. We present a Java application programming interface (API) for mzQuantML 
called jmzQuantML. The API provides robust bridges between Java classes and ele- 
ments in mzQuantML files and allows random access to any part of the file. The API 
provides read and write capabilities, and is designed to be embedded in other soft- 
ware packages, enabling mzQuantML support to be added to proteomics software tools 
(http://code.google.eom/p/jmzquantmI/). The mzQuantML standard is designed around a 
multilevel validation system to ensure that files are structurally and semantically correct for 
different proteomics quantitative techniques. In this article, we also describe a Java software 
tool (http://code.google.eom/p/mzquantml-validator/) for validating mzQuantML files, which 
is a formal part of the data standard. 
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The two most common tasks performed in proteomic re- 
search are the identification or quantification of proteins or 
peptides, using MS followed by data analysis. The Proteomics 
Standards Initiative (PSI) has been working for a number of 
years to develop data standards to assist data sharing, software 
development, and database submissions for the different data 
types produced in these typical workflows. The PSI has re- 
leased mzML for raw MS data or peak lists [1], mzIdentML 
for peptide and protein identification, for example exported 
from a search engine [2], TraML [3] for encoding transition 
lists and associated metadata, and, recently, mzQuantML for 
quantitative data [4]. The model is developed as an Extensible 
Markup Language (XML) Schema Definition file, accompa- 
nied by controlled vocabulary (CV) terms and definitions as 
part of the PSI-MS CV [5], also used in mzML, mzIdentML, 
and TraML. The use of correct CV terms within the schema 
is governed by a mapping file, defining the terms that MUST, 
SHOULD, or MAY (formal keywords) used at particular loca- 
tions within a file. 
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Four quantitative proteomics techniques are currently 
supported in mzQuantML version 1.0: (i) intensity-based 
(MS^) label free, (ii) MS^ label-based (such as SILAC or 
N^'), (iii) MS^ tag-based (iTRAQ or tandem mass tag), 
and (iv) spectral counting. A review on software support 
for quantitative proteomics can be found in [6]. All the 
latest example files can be found in http://code. google 
.com /p /mzquantml/source /browse /#svn /trunk/examples / 
versionl.O. The support for SRM is in progress and will be 
soon submitted as an update to the current specifications. 
Each of these techniques is represented using the same 
core mzQuantML structures, but the types of structures that 
must be used for each technique are governed by a set of 
"semantic rules." The semantic rules are written in natural 
language and define the elements that MUST, SHOULD, or 
MAY appear in the file for each technique along with some 
conditions. 

The jmzQuantML application programming interface 
(API) is developed by following a similar design as jmzML [7] 
and jmzIdentML [8]. It is developed in Java, thus making 
it platform independent, using open source frameworks 
such as Java architecture for XML binding (JAXB), Maven 2, 
Log4j, and J Unit. The API is copyrighted under the Apache 
Software License 2.0 (http://www.apache.org/licenses/ 
LICENSE-2.0.html). The two recommended ways for using 
jmzQuantML are by downloading the Java Archive (JAR) file 
and manually importing the JAR into a new Java project, 
or by exploiting Maven's dependency mechanism (details 
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Figure 1. Diagram of the steps 
for marshalling and unmar- 
shalling mzQuantlVlL files us- 
ing the jmzQuantlVlL API, 250 x 
129 mm (300 x 300 DPI). 



can be found in http://code. google. com/p/jmzquantnil/). 
Full Java documentation of the API is available from http:// 
jmzquantml.googlecode.com/svn/trunk/src/docs/api/index. 
html. 

Proteomic data represented in mzQuantML can produce 
large files, since the format is able to capture quantita- 
tive data about proteins or protein groups, peptides, and 
features (regions of 2D LC-MS space), as well as all ref- 
erences between these elements to ensure a full trace of 
the steps performed by analysis software is retrievable. 
As such, if software packages attempted to load complete 
mzQuantML files into memory, they would run into mem- 
ory overload problems. A technique called the xxindex 
(http://code.google.eom/p/pride-toolsuite/wiki/XXIndex) is 
employed in jmzQuantML to allow random access to any 
XML element in an mzQuantML file via its XPath (the hierar- 
chy and location of an element within an XML file). With this 
technique, which is also used in jmzML and jmzIdentML, 
jmzQuantML API can extract individual elements via the 
xxindexer (an index recording both start and end byte posi- 
tions of each element) from large files stored on disk quickly, 
and use a low memory overhead. The same technique can 
also be used to build large files. 

Reference resolving is another important feature of the 
API. Many elements within mzQuantML files contain refer- 
ences to other objects via a unique identifier. The handling 
of references within jmzQuantML is controlled by a config- 
uration file. The initial stages of reference resolving are per- 
formed at the same time when the xxindexer scans the file 
for creating the XPath-based index. Depending on the user 
preference stored in the configuration file, the API decides 
whether to resolve a reference by creating a full Java object 
of the referenced element (called autoRefResolving), or store 
only the unique (String) identifier of the referenced element 
(called idMapped). There is a default configuration file in the 
API where the reference resolving is enabled for all the objects 
by default. It may be desirable for memory saving purpose 



to turn off the reference resolving mechanism for elements 
expected to contain a large number of references to another 
object, such as <PeptideConsensus> elements (representing 
peptides quantified across replicates) referencing <Feature> 
elements (representing 2D regions of LC-MS space that have 
been quantified). 

We have generated some example code to demonstrate 
how to use jmzQuantML for marshalling and unmarshalling 
mzQuantML files, which is available from http://code 
.google.com/p/jmzquantml/w/list. The API implements 
two main functions for mzQuantML files: marshalling and 
unmarshalling. Marshalling is the process of writing Java 
objects as XML fragments to an mzQuantML file stored on 
the disk, whereas unmarshalling is the reverse — reading 
XML fragments from an mzQuantML file into memory 
as Java objects. These functions are provided as two main 
classes: MzQuantMLMarshaller and MzQuantMLUn- 
marshaller. These classes provide more than one way of 
performing marshalling and unmarshalling operations in 
order to give more flexibility to users. Figure 1 illustrates the 
work-flow of marshalling and unmarshalling. Marshalling 
an mzQuantML object to a flle is more straightforward 
than unmarshalling. The user flrst creates an MzQuantML- 
Marshaller object, then must follow one of two different 
paths: via the MzQuantML object (Path A in the Marshall 
flowchart in Fig. 1) or FileWriter (Path B in Fig. 1). Path 
A requires the user to create an empty MzQuantML object 
as the root element to which all other element objects are 
attached. The element objects are generated by the JAXB 
compiler directly from the mzQuantML schema definition 
file. Each element object provides setter and getter methods 
for populating data into correct and valid attribute types. 
The relationship between the mzQuantML elements and the 
Java objects is illustrated in Fig. 2. All the required element 
objects are added to the root object before marshalling the 
whole object to an mzQuantML file. This method is only 
recommended for marshalling small files, as the entire object 
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Figure 2. An example illustrating the relationship between the 
object class in jmzQuantlVlL, 99 x 70 mnn (600 x 600 DPI). 



:ProteinList> element in the XML Schema Definition and the ProteinList 



tree becomes loaded into memory. For large file, API users 
are recommended to follow the FileWriter method (Path B). 
By using a FileWriter, no memory issues occur and files of 
almost any size can be produced. The marshalling process in 
Path B functions via writing each required blocks of elements 
direct to file, for example the process starts by writing a start 
tag (MzQuantMLMarshaller.createMzQuantMLStartTag) 
and finishes by writing closing tag (MzQuantMLMar- 
shaller.createMzQuantMLClosingTag). This method for 
file writing requires API users to be aware of the order of 
elements that must be written to create a valid mzQuantML 
file. The recommended order can be found inspecting 
the mzQuantML XML Schema. For jmzQuantML users 
following Path A, the order of element writing is not 
important. 

The unmarshalling workfiow is shown in left panel 
of Fig. 1. An MzQuantMLUnmarshaller object must be 
created to read from the given file. The class equips the 
user with some useful functions to explore an mzQuantML 
file (e.g. getMzQuantMLId, getMzQuantMLName, and 
getMzQuantMLVersion). The unmarshal method can 
be invoked from one of three parameter types: (i) via 
a named object type within the MzQuantMLElement 



class (e.g. MzQuantMLElement.AuditCoUection), an 
XPath string describing the location of the element 
in the file (e.g. /MzQuantML/AnalysisSummary), 
or the class name of an element object (e.g. 
uk.ac.liv.jmzqml.model.mzqml.CvList.class), and it re- 
turns a Java object wrapping the element type and attributes 
from the values in the file. For objects that appear in multiple 
occurrences in a file, jmzQuantML provides a method 
(unmarshalCoUectionFromXpath) for retrieving a list of 
objects that can be iterated over. Through any of these 
options, jmzQuantML processes the data in the file, return- 
ing Java objects to the user of the API for onward processing 
in the local application. 

Although mzQuantML version 1.0 has only been a PSI 
standard since March 2013, there are already several software 
packages supporting the format, which use jmzQuantML in- 
ternally, such as ProteoSuite (http://www.proteosuite.org/), 
x-Tracker (http://www.x-tracker.info/) and the Progenesis 
Post-Processor [9]. A tutorial describing how to develop soft- 
ware using PSI standards can be found in [10]. 

There is no guarantee that an mzQuantML file created by 
the jmzQuantML API is semantically valid, as the schema is 
designed vvdth flexibility and future extensibility in mind and 
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mzQuantML requires multilevel validation, which is outside 
the scope of a typical API. Specifically, as part of the formal 
mzQuantML standard, we require formal validation software 
to check three levels of validation: (i) a file is valid against the 
XML Schema, (ii) correct CV terms are used in the correct 
locations of the file, and (iii) the semantic rules are correctly 
fulfilled. 

As an example of the CV validation requirement, the 
<AnalysisSummary> element in each file must contain 
a CV term specifying the quantitative technique used for 
the experiment, e.g. "LC-MS label-free quantitation analysis" 
(MS:1001834) or "MSI label-based analysis" (MS:1002018). 
These CV mapping rules are captured in an XML file as a 
formal part of the standard, defining which CV terms MUST, 
SHOULD, or MAY be present at each location in the file that 
has the potential to be parameterized. As an example of a se- 
mantic rule — for an MS^ tagging-style file (e.g. iTRAQ data) — 
"If PeptideConsensusList is present there MUST be a Fea- 
tureList present and there MUST be an MS2AssayQuantlayer 
present." The rule states the general mzQuantML structures 
that must be present in an iTRAQ-style file and if they are 
absent a fatal error will result. 

We have developed an mzQuantML validator (http://code 
.cgoogle.om/p/mzquantml-validator/) to ensure that soft- 
ware packages export data in a consistent way. The validator 
checks if an mzQuantML file is syntactically and semanti- 
cally valid and has used appropriate CV terms. The validator 
is implemented using j mzQuantML and the PSI validator 
framework [11]. The validator comes with a basic graphical 
interface, as it is intended as a tool for developers rather 
than laboratory scientists. When a user loads a file, the val- 
idator will process the file against the different rules, and 
export messages at different levels of severity. Since XML 
schema validation (using any known method) can be slow 
for very large mzQuantML files, the validator allows the user 
to skip schema validation and process semantic validation 
rules only if desired. We have tested the validator with file 
sizes up to 150 MB, which takes approximately 1 min on a 
standard desktop PC without schema validation and several 
hours with schema validation. The validator is particularly 
relevant for software developers adding mzQuantML export 
capabilities to data analysis software; since the mzQuantML 
specifications are large and complex and without robust val- 
idation software, there is a danger of producing invalid files. 
We also produced valid and invalid test files in our repository 
for users to test. The header of each file contains informa- 
tion about the expected errors. We encourage users to try the 
validator on these files and see how the validator reports the 
errors encountered. 

We present two Java-based open source resources, named 
the jmzQuantML API and the mzQuantML validator, to sup- 
port the new mzQuantML standard from the HUPO-PSI. The 
jmzQuantML API follows the successful design pattern em- 
ployed in APIs for other PSI standards that are now in com- 
mon use in a variety of software toolkits, enabling rapid devel- 
opment of import and export capabilities. The mzQuantML 



validator will be an essential resource for all developers wish- 
ing to add mzQuantML export capabilities to their software 
packages. 
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